Analysis of Embeddings¶

Table of contents

  • Analysis of Embeddings
    - Loading Embeddings
    - Loading Data Partitions
    - Merging content, title and label to embedding vectors
    • Visualizing Labeled data
      • MiniLM Embedding Space
        • PCA of MiniLM
      • mpnet-base Embedding Space
        • PCA of mpnet-base

The notebook at hand aims to dive into the possible patterns dimensionality reduction techniques can show within the proposed embedding models.

To analyze the embedding spaces we use Arize's Phoenix app which decomposes the high dimensionality into a 3-dimensional space using UMAP. Additionally we will also look at each embedding space in a PCA-decomposed representation to see how much impact the decomposition algorithm may have.

We looked at the performance of the following two embedding models:

  • all-MiniLM-L6-v2
  • all-mpnet-base-v2
In [ ]:
import os
import sys

import pandas as pd
import plotly
import plotly.express as px

from dotenv import load_dotenv

current_dir = os.getcwd()
parent_dir = os.path.dirname(current_dir)

sys.path.append(parent_dir)

plotly.offline.init_notebook_mode()
load_dotenv()
Out[ ]:
True

Loading Embeddings¶

The first step is to load the persisted embeddings from each embedding model. The embedding vectors (i.e. embedding vector matrix) was saved in the data/embeddings/* folder for each split in the initial train, test and validation split.

The following block loads these split up matrices into memory.

In [ ]:
DATA_DIR = os.getenv('DATA_DIR', 'data')
EMBEDDING_DATA_DIR = os.path.abspath(os.path.join(parent_dir, DATA_DIR, 'embeddings'))

weak_labelled = {}

embedding_model_dirs = [d for d in os.listdir(EMBEDDING_DATA_DIR) if os.path.isdir(os.path.join(EMBEDDING_DATA_DIR, d))]
embeddings = {}

for dir in embedding_model_dirs:
    print(f"- Opening Embeddings from {dir}")
    curr_embeddings = {}
    for file in os.listdir(os.path.join(EMBEDDING_DATA_DIR, dir)):
        if file.endswith('.pkl'):
            filename = file.split('.')[0]
            curr_embeddings[filename] = pd.read_pickle(os.path.join(EMBEDDING_DATA_DIR, dir, file))
        print(f"  - Read {file}")
    embeddings[dir] = curr_embeddings
- Opening Embeddings from mpnet_base
  - Read validation_set.pkl
  - Read unlabelled_dev.pkl
  - Read labelled_dev.pkl
- Opening Embeddings from mini_lm
  - Read validation_set.pkl
  - Read unlabelled_dev.pkl
  - Read labelled_dev.pkl

Loading Data Partitions¶

To get a full glance at the embedding space's attributes we may want to look at the content of a review and how it relates to other reviews in the space so within this next code block we gather the nominal attributes of each review and load the three split parquets (train, test and validation or in our case unlabelled, labelled and validation) into the memory.

In [ ]:
PARTITIONS_DATA_DIR = os.path.abspath(os.path.join(parent_dir, DATA_DIR, 'partitions'))

partitions = {}

for file in os.listdir(PARTITIONS_DATA_DIR):
    if file.endswith('.parquet'):
        filename = file.split('.')[0]
        partitions[filename] = pd.read_parquet(os.path.join(PARTITIONS_DATA_DIR, file))
    print(f'- Read {file}')
- Read labelled_dev.parquet
- Read validation_set.parquet
- Read unlabelled_dev.parquet

Merging content, title and label to embedding vectors¶

The goal now is to merge the matrix representations with the dataframes. This will later on allow us to pass a dataset into Phoenix that contains the reviews content and its embedding vector.

In [ ]:
import numpy as np

merged_partitions = {}

for embedding_model in embeddings:
    print(f'Merging partitions for model {embedding_model}')
    merged_list = []
    
    for partition_key, partition_df in partitions.items():
        curr_partition_name = partition_key.split('_')[0]
        matched = False
        
        for embedding_key, embedding_array in embeddings[embedding_model].items():
            if curr_partition_name == embedding_key.split('_')[0]:
                if isinstance(embedding_array, (list, pd.Series)):
                    embedding_array = np.array(embedding_array)
                
                if len(partition_df) == embedding_array.shape[0]:
                    partition_df = partition_df.copy()
                    partition_df['embedding'] = embedding_array.tolist()
                    merged_list.append(partition_df)
                    matched = True
                    print(f"  - Merged {embedding_key} with {partition_key}")
                else:
                    print(f"  - Number of rows do not match for {embedding_key} and {partition_key}")
        
        if not matched:
            print(f"  - No matching embedding found for {partition_key}")
    
    if merged_list:
        merged_partitions[embedding_model] = pd.concat(merged_list, ignore_index=True)
    else:
        print(f"No partitions were merged for model {embedding_model}")
Merging partitions for model mpnet_base
  - Merged labelled_dev with labelled_dev
  - Merged validation_set with validation_set
  - Merged unlabelled_dev with unlabelled_dev
Merging partitions for model mini_lm
  - Merged labelled_dev with labelled_dev
  - Merged validation_set with validation_set
  - Merged unlabelled_dev with unlabelled_dev

Visualizing Labeled data¶

To project the high dimensional embeddings into a humanly readable format we implemented Arize's Phoenix app that allows us to interactively look at the embedding space projected down into 3 dimensions by UMAP.

Additionally, it might be insightful to also look at a different dimension reduction approach - Therefore we made the plot_pca function which will project the embedding space into two dimensions using PCA.

In [ ]:
from sklearn.decomposition import PCA

def break_content(text, length=50):
    lines = []
    while len(text) > length:
        space_index = text.rfind(' ', 0, length)
        if space_index == -1:
            space_index = length
        lines.append(text[:space_index])
        text = text[space_index:].lstrip()
    lines.append(text)
    return '<br>'.join(lines)


def plot_pca(weak_labelled, key):
    if key not in weak_labelled:
        raise ValueError(f"File {key} not found in the weak_labelled dictionary.")

    df = weak_labelled[key]

    embeddings = np.vstack(df['embedding'].values)
    content = df['content'].apply(lambda x: break_content(x)).values

    pca = PCA(n_components=3)
    reduced_embeddings = pca.fit_transform(embeddings)

    pca_df = pd.DataFrame(reduced_embeddings, columns=['PCA1', 'PCA2', 'PCA3'])
    pca_df['Content'] = content

    fig = px.scatter_3d(pca_df, x='PCA1', y='PCA2', z='PCA3',
                        title=f'PCA of Embedding Vectors for {key}',
                        size_max=5, opacity=0.6, height=800,
                        hover_data={'Content': True})
    fig.update_traces(marker=dict(size=2))
    fig.show()

MiniLM Embedding Space¶

First we will take a look at how the embedding space of the mini-lm embedding model looks.

Note, to see the projected space in the Phoenix app, make sure to click the "text_embedding" link inside the app, this will load the 3-dimensional UMAP projection. Another thing to note is that UMAP in uses stochastic algorithms to speed up calculation so the representation you see may not look the same as we noted down so this decomposition approach is non-deterministic.

In [ ]:
from src.px_utils import create_dataset, launch_px

mini_lm_ds = create_dataset('mini_lm', merged_partitions['mini_lm'], merged_partitions['mini_lm']['embedding'], content=merged_partitions['mini_lm']['content'])

px_session = launch_px(mini_lm_ds, None)
px_session.view()
converting items in column `embedding` to numpy.ndarray, because they have the following type: list
❗️ The launch_app `port` parameter is deprecated and will be removed in a future release. Use the `PHOENIX_PORT` environment variable instead.
❗️ The launch_app `host` parameter is deprecated and will be removed in a future release. Use the `PHOENIX_HOST` environment variable instead.
🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix
📺 Opening a view to the Phoenix app. The app is running at http://localhost:6006/
Out[ ]:

The embedding positions in the space are relatively clearly clustered. In this first example we are using the embedding vectors of huggingface's all-MiniLM-L6-v2 BERT sentence transformer. This sentence transformer was trained on sentence pairs that appear as a Q&A. The resulting vector embedding therefore describes the semantic content of such a sentence - This is exactly what we can see in the embedding space; Reviews of the amazon-polarity dataset are clustered together according to their product niche, as for example already mentioned with the cluster containing music reviews.

But we can also observe other semantic relationships:

  • Video game reviews lie between music and book reviews: This axis could perhaps describe interactivity; music can be enjoyed passively, games do have some interactions between cutscenes while books capture ones concentration and attention entirely.
  • Video game reviews lie opposite of tech gadgets and other devices: This axis might describe the abstraction of virtuality. Games are completely virtual tech while tech gadgets are physical devices.
  • Kid's toys are clustered between games and tech gadgets

PCA of MiniLM¶

Since Phoenix doesn't allow for a different dimension reduction technique we implement a PCA strategy ourselves. The UMAP technique differs vastly from PCA so looking at another technique could yield more interesting observations in the embedding space. PCA on the other hand is deterministic so the observations made may make more sense.

In [ ]:
plot_pca(merged_partitions, 'mini_lm')

Compared to the UMAP representation the PCA reduction shows a triangular shaped embedding space, at each corner a cluster emerges. We can still roughly see the following four clusters:

  • Music albums
  • Books
  • Movies
  • Tech Gadgets

So this visualization again support the claims made in the above analysis; The all-MiniLM-L6-v2 clearly succeeds in embedding and clustering the reviews according to their semantic relatedness.

mpnet-base Embedding Space¶

Now we look at the embedding space of the all-mpnet-base-v2 model.

In [ ]:
mpnet_base_ds = create_dataset('mpnet_base', merged_partitions['mpnet_base'], merged_partitions['mpnet_base']['embedding'], content=merged_partitions['mpnet_base']['content'])

px_session = launch_px(mpnet_base_ds, None)
px_session.view()
converting items in column `embedding` to numpy.ndarray, because they have the following type: list
Existing running Phoenix instance detected! Shutting it down and starting a new instance...
❗️ The launch_app `port` parameter is deprecated and will be removed in a future release. Use the `PHOENIX_PORT` environment variable instead.
❗️ The launch_app `host` parameter is deprecated and will be removed in a future release. Use the `PHOENIX_HOST` environment variable instead.
🌍 To view the Phoenix app in your browser, visit http://localhost:6006/
📺 To view the Phoenix app in a notebook, run `px.active_session().view()`
📖 For more information on how to use Phoenix, check out https://docs.arize.com/phoenix
📺 Opening a view to the Phoenix app. The app is running at http://localhost:6006/
Out[ ]:

The UMAP projection of mpnet_base as seen in Phoenix also shows roughly the same clusters as the UMAP projection of the mini_lm embedding space. The main and most obvious sights stay the same as already noted in the previous exploration on the mini_lm's decomposition:

  • One cluster that separates itself from the other points in the space is the music-related cluster
  • On the other side of the space much data points seem to be about books

PCA of mpnet-base¶

In [ ]:
plot_pca(merged_partitions, 'mpnet_base')

This mpnet_base PCA projection shows a decomposed space similar to the PCA of the mini_lm embedding model. This observation makes sense because both embedding models were trained with similar BERT-style objectives focused on mapping and clustering the semantic meanings of sentences. Consequently, the decomposed spaces map similar variances onto the principal components.

A confusion that might arise is the fact that when comparing both 3D-PCA plots the principal component #2 seems to be flipped. This does not change the components meaning since the principal components derived from PCA are unique up to a sign flip. This is because the eigenvectors of a covariance matrix (which define the principal components) can point in either direction along the axis they define. Both directions represent the same principal component, just with inverted signs.